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Abstract 

,In this paper, we develop a highly efficient molecular dynamics code fully implemented on graphics processing units for thermal 
conductivity calculations using the Green-Kubo formula. We compare two different schemes for force evaluation, a previously used 
thread-scheme where a single thread is used for one particle and each thread calculates the total force for the corresponding particle, 
and a new block-scheme where a whole block is used for one particle and each thread in the block calculates one or several pair 
forces between the particle associated with the given block and its neighbor particle(s) associated with the given thread. For both 
schemes, two different classical potentials, namely, the Lennard- Jones potential and the rigid-ion potential are implemented. While 
,the thread-scheme performs a little better for relatively large systems, the block-scheme performs much better for relatively small 
systems. The relative performance of the block-scheme over the thread-scheme also increases with the increasing of the cutoff 
jadius. We validate the implementation by calculating lattice thermal conductivities of solid argon and lead telluride. 
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1. Introduction 

, Recently, graphics processing units (GPU) have attracted 
more and more attention in the computational physics commu- 
|nity due to their large-scale parallelism ability, which has been 
explored in many kinds of computational physics problems, in- 
Icluding gravitational A^-body simulation 1 1]], classical molecu- 
lar dynamics simulation ||2|-[8|], classical Monte Carlo simula- 
Ition ipl, quantum Monte Carlo simulation ifioll . quantum trans- 
port il ill and exact diagonalization il2ll . just to name a few. 
^Despite the achievements obtained so far, the potential of GPU 
to speed up computational physics applications are far from be- 
ing fully explored. 

One of the physical problems which is very demanding of 
high performance computing is molecular dynamics (MD) sim- 
ulation and its GPU acceleration has been studied by several 
groups 13-01 ■ As early as in 2006, Yang et al |01 carried out 
a proof-of-principle study on the acceleration of equilibrium 
molecular dynamics simulation for thermal conductivity pre- 
diction of solid argon based on the Green-Kubo formula using 
a GPU. For this purpose, there is no need to update the neigh- 
bor list since every particle in the system just oscillates around 
its equilibrate position. Using a very old GPU, they obtained a 
speedup of about 10 relative to a single CPU, although their im- 
plementation suffers from accuracy loss, which results in a heat 
current autocorrelation function (HCACF) not decaying to zero 
but a finite value. After this pioneering work, a few other groups 
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Studied the GPU-acceleration of more general MD simulations, 
mainly oriented to large-scale simulations in biochemical sys- 
tems, and the problem of neighbor list update attracts special 
considerations Ht] ■ The reported speedups over CPU im- 
plementations range from one to two orders of magnitude, de- 
pending on the interaction potential, the simulation size and the 
cutoff radius used in the simulations. 

In this work, we consider the development and optimization 
of a full GPU implementation of equilibrium MD simulations 
and use it for thermal conductivity predictions using the Green- 
Kubo formula. The main difference between our work and the 
previous ones Q-Sl is that we are more interested in relatively 
small systems, since size effects for the Green-Kubo method 
of thermal conductivity prediction are not very significant, and 
they can often be eliminated with less than several thousand 
particles. The demand for high performance computing for the 
Green-Kubo method comes from the fact that the number of 
simulation steps usually needs to be very large (sometimes as 
large as 10** steps 11311 ) to ensure a well converged HCACF and 
the corresponding running thermal conductivity (RTC). Around 
this simulation size, we find that the conventional approach of 
force evaluation where a single thread is assigned to evaluate 
the total force on one particle Isfj, can not utilize the compu- 
tational power of modern GPUs sufficiently. Here, we pro- 
pose another approach of force evaluation, where one block of 
threads is assigned to evaluate the total force on one particle, 
which exhibits significantly superior performance for small-to- 
intermediate simulation sizes. In the one thread per particle 
scheme (which will be called the thread-scheme), the number of 
invoked blocks in the force evaluation kernel, which equals the 
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number of particles divided by the block size, is much smaller 
than that in the one block per particle scheme (which will be 
called the block-scheme), where it equals the number of parti- 
cles. We will demonstrate that the block-scheme can attain its 
optimal performance for relatively samll systems and is several 
times faster than the thread-scheme for these. Compared with 
the proof-of-principle study of Yang et al iQl, which only cal- 
culates the lattice thermal conductivity of solid argon using the 
simple Lennard- Jones (LJ) potential, we also consider the rigid- 
ion (RI) potential, which is more computational demanding. On 
average, two orders of magnitude speedups can be obtained for 
the evolution part of the MD simulation. For all the tested sit- 
uations, the acceleration rates for the RI potential are several 
times larger than those for the LJ potential, reflecting the higher 
arithmetic intensity for the RI potential. In addition, neighbor 
list construction and HCACF calculation are also implemented 
in the GPU and good speedups are obtained for these parts too. 

This paper is organized as follows. In Sect. [2] we review the 
techniques of molecular dynamics simulation for thermal con- 
ductivity prediction using the Green-Kubo formula and some 
important features of CUDA relevant to our implementation. In 
Sect. |3] we present the detailed algorithms and implementa- 
tions of our code. In Sect. H) the performances for the different 
force evaluation schemes and the different potentials are shown 
and compared. In Sect. |5] we validate our implementation by 
computing the lattice thermal conductivities of solid argon and 
lead telluride at different temperatures. Our conclusions will be 
presented in Sect. |6] 

2. Backgrounds 

2.1. Green-Kubo method for thermal conductivity calculations 
In atomistic calculations of lattice thermal conductivity, both 
the equilibrium-based and the non-equilibrium-based molec- 
ular dynamics simulations can be employed. The non- 
equilibrium-based method uses the Fourier's law of heat con- 
duction and is referred to as the direct method. The equilibrium- 
based method uses the Green-Kubo linear response theory and 
is referred to as the Green-Kubo method. Schelling et al 11411 
showed that the equilibrium and the nonequiUbrium methods 
give consistent results. In some cases, the former is more ad- 
vantageous; in other cases, the reverse is true. The most im- 
portant advantage of the Green-Kubo method is that it has a 
much less prominent size effect compared to the direct method. 
In the direct method, unless the simulation cell is many times 
longer than the mean-free path, scattering from the heat source 
and heat sink contributes more to the thermal resistivity than the 
intrinsic anharmonic phonon-phonon scattering does. In most 
cases, a value for the thermal conductivity of an infinite sys- 
tem can be reliably obtained by the direct method from sim- 
ulations of different system sizes and an extrapolation to an 
infinit e sy stem size 11411 . However, as pointed out by Sellan 
et al il5n . the Unear extrapolation procedure is only accurate 
when the minimum system size used in the direct method sim- 
ulations is comparable to the largest mean free paths of the 
phonons that dominate the thermal transport. As for the com- 
putational cost, the Green-Kubo method requires a rather long 



simulation time (10^ ~ 10*^ time steps) to get a well converged 
HCACF and thermal conductivity value. To get a well defined 
temperature gradient, the direct method only requires a rela- 
tively short simulation time (~ 10^ time steps). However, the 
Green-Kubo method is still generally computationally cheaper 
since one needs to simulate several very large systems when 
using the direct method to accurately extrapolate to the infinite 
limit. Additionally, by a single simulation, one can obtain the 
full thermal conductivity tensor of the system when using the 
Green-Kubo method, but can only obtain a single component 
of the thermal conductivity when using the direct method. 

According to the Green-Kubo linear response theory lfl6l - 
1811 . the lattice thermal conductivity tensor /c^v can be expressed 
as a time integral of the HCACF Cftvit), 
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where t is the correlation time, ks the Boltzmann's constant, 
V the volume and T the temperature. The HCACF C^v{t) - 
{J^{0)Jv{t)) is computed in the MD simulation by an average 
over different time origins. 
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where 5t is the time step and M is the number of time origins 
to be averaged, which is approximately the number of produc- 
tion steps if the number of correlation steps is much less than 
the number of production steps. For isotropic 3D materials, the 
thermal conductivity scalar is usually defined to be the average 
of the diagonal elements, {k^x + Kyy + k^z)/^- Similarly, for a ma- 
terial isotropic in a plane, the thermal conductivity in that plane 
can be defined to be (/c„ + Kyy)/2. Generally, the Green-Kubo 
method is able to produce the full tensor of thermal conductiv- 
ity in a single run. We will use C to denote the vector with 
components Cvi, Cw and C-,. 

The heat current vector J (with components J^, Jy and J-) is 
defined to be the time derivative of the sum of the moments of 



the site energies of the particles in the system fllSIl 
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where the site energy is the sum of the kinetic energy and the 
potential energy, £, - ^mj\r + {/,. For a two-body potential, 
U, = i Y.J Uij and = 2,. Fij = 2,. (-|-) C/,,, we have 



where 
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with r,j = rj - r,- and v,j = \j + v,-. 
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In this paper, we consider two kinds of pair-wise potentials. 
The first is the LJ potential of the form 
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where e and cr are two parameters of the model, the dimensions 
of which are [energy] and [length] respectively. For argon, 
e/kB - 1 19.8 K and cr - 3.405 A, where ks is the Boltzmann's 
constant. This potential has been used by many groups to cal- 
culate thermal conductivities of both pure solid argon Iisit23]l 
and argon-krypton composites l24l 12511 . The second potential 
we consider is the RI potential which consists of the Coulomb 
potential as well as a short range part in the Buckingham form 
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The presence of the subscripts in the parameters Ajj, pij and 
Cij indicates that the values of the parameters depend on the 
particle types of the interacting pairs. 

The other part of the RI potential is the Coulomb potential 
which is a long range potential, and direct evaluation of it us- 
ing the traditional Ewald summation is very time consuming for 
large systems. Instead, the Wolf method 12611 is more advanta- 
geous both computationally and conceptually. In our work, we 
use a modified form of the Wolf formula developed by Fennell 
et al 12711 . in which both the potential and force are continuous 
at the cutoff radius. 
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where a and Rc are the electrostatic damping factor and the cut- 
off radius, respectively. The RI potential in the Buckingham 
form is used extensively to study lattice thermal conductivities 
of various kinds of semiconductors and insulators such as com- 
plex sihca crystals [Q, ZnO Q and PbTe In this work, 
we only consider PbTe, using the potential parameters devel- 
oped by Qiu et al jsTll . Note that in their parameterizations, 
partial charges are used for both Pb and Te, which are +0.666 
and -0.666, respectively. 

To obtain the lattice thermal conductivity form HCACF, one 
usually first fits the HCACF by a double-exponential function 



, 13211 with two time parameters and obtains the thermal con- 
ductivity by analytical integration of the function. Ideally, the 
HCACF should decay to zero except for some anomalous sys- 
tems with divergent thermal conductivity 13311 . Therefore, alter- 
natively, one can also directly integrate the HCACF and average 
the resulting RTC values in an appropriate range of time block 
1 14j, [151] . We will show that the RTC will converge to a definite 
value as long as the simulation time is large enough to obtain a 
smooth HCACF which decays to zero after a given correlation 
time. 



2.2. Overview of CUDA 

In this subsection, we review some techniques for GPU pro- 
gramming with CUDA. Only the important features relevant to 
our implementation are presented here. For a more thorough 
introduction to CUDA, we refer to the official manual jsill- Al- 
though our discussion is based on Tesla M2070, which is of 
compute capability 2.0, the implementation and optimization 
can be easily ported to other platforms. 

CUDA (Compute Unified Device Architecture) ll34ll is a par- 
allel computing architecture for a hybrid platform of CPU (the 
host) and GPU (the device). A CUDA program consists of 
both common C/C++ code which is compiled and executed on 
the host and special functions called kernels invoked from the 
host and executed on the device. When a kernel is called from 
the host, a grid of blocks, each of which contains individual 
threads, are invoked to execute the instructions of the kernel in 
a single instruction multiple data way. Each individual thread 
has a unique ID which can be specified by built-in variables 
such as threadldx . x and blockldx . x. The sizes of the grid 
and the blocks for a kernel are specified at runtime and can 
also be inferred from built-in variables such as gridDim . x and 
blockDim.x. 

The concept of CUDA is closely connected to the GPU hard- 
ware architecture, the knowledge of which is vital for under- 
standing and optimizing CUDA applications. A GPU consists 
of a number of streaming multiprocessors (SMs) and an SM 
consists of a number of scalar processors (SPs). For our test- 
ing GPU, Tesla M2070, there are 16 SMs and 16 x 32 = 448 
SPs. The maximum numbers of resident blocks and threads 
that can be simultaneously invoked within an SM are limited. 
For GPUs with compute capability 2.x, they are 8 and 1536, 
respectively 13411 . Thus, the theoretically optimal number of 
threads in a block (the block size) is about 1536/8 - 192 for 
Tesla M2070. In this case, the number of active blocks and 
the number of active threads in an SM all reach their optimized 
values. In practice, the optimal block size is also affected by 
the specific problem under consideration. For example, in the 
block-scheme of force evaluation, a block size of 128 is better, 
since 192 is not an exponential of 2, and thus not suitable for 
binary reduction. 

Understanding different GPU memories is also important 
for the implementation and optimization of a CUDA program. 
There are various types of memories in the GPU, each of which 
has its own strengths and limits. The most important memo- 
ries of the GPU are global memory, shared memory, registers, 
constant memory and texture. Global memory plays the cru- 
cial roles of exchanging data between the CPU and the GPU 
and passing data from one kernel to another Global memory 
can be both read and written by each thread in a kernel but is 
very slow, thus should be minimized. Shared memory plays the 
important role of sharing data within a block, which is crucial 
when we do some reduction calculations such as summation. 
Shared memory is fast but expensive. For Tesla M2070, there 
are only 48 KB shared memory for each SM. If we hope to 
keep the maximum number of resident blocks (which is 8 per 
SM), we should not define more than 6 KB shared memory in 
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a kernel. Registers are private read-and-write memories for in- 
dividual threads. The access of registers is very fast, but the 
amount of register memory is rather limited. For Tesla M2070, 
there are only 32 K 32-bit registers for each SM. If we hope to 
keep the maximum number of resident threads (which is 1024 
per SM if we take the block size as 128), we should not define 
more than 32 32-bit registers (or 16 64-bit registers) in a ker- 
nel. Compared with global memory, constant memory is faster 
but its amount is also very limited, only 64 KB. Although some 
data in our application, such as the neighbor list, do not change 
during the entire computation, we cannot use constant memory 
for them since the required amount of memory for the neighbor 
list exceeds the upper bound even for a small system with a few 
hundred particles. Alternatively, textures can be used for this 
kind of data. We only use constant memory for some invariant 
parameters needed in the kernels. 

Lastly, we note that only when the number of blocks is sev- 
eral times larger than the number of SMs to keep the GPU busy 
at all times, can we have the possibility of attaining the peak 
performance of the GPU. This important feature will be dis- 
cussed in detail when we compare the two schemes of force 
evaluation in the following sections. 

3. Implementation details 

In this section, we describe in some detail the techniques of 
the CUDA implementation of our code. 

3.1. The overall consideration 

Firstly, we should find out which part of the CPU code de- 
serves to be rewritten using CUDA. Since force evaluation is 
the most time-consuming part of an MD code, it is expected 
that a CUDA acceleration of this part of the code would result 
in a significant speedup. However, this is not the case due to 
the following two reasons: (1) even if the force evaluation part 
takes up 99% of the computation time, a 100-fold speedup of 
this part would only result in about a 50-fold speedup for the 
whole program; (2) if we only rewrite this part of code using 
CUDA, we have to exchange data between CPU and GPU fre- 
quently, which is very time-consuming. This will result in a 
lower speedup for the whole program. Thus, we should imple- 
ment the whole evolution part of the program in CUDA. 

After confirming this, the next question is whether the whole 
evolution part should be implemented in a single kernel or mul- 
tiple kernels. At first thought, the single kernel approach seems 
to be very appealing, since in this way, global memory access 
can be reduced to its minimal amount. Unfortunately, this is 
not easy to achieve. When we calculate the total force exerted 
on one particle, we need the positions of all of its neighbor 
particles, some of which would reside in different blocks from 
the one where the considered particle resides in. Since there is 
no intrinsic way to synchronize the individual blocks ll34ll . the 
positions of the neighbor particles cannot be guaranteed to be 
completely updated before evaluating the total force of a given 
particle. Therefore, the natural choice is to implement multiple 
kernels for the evolution part, some devoting to the updating of 



positions and velocities and one devoting to the force evalua- 
tion. 

Thus, the overall structure of the whole CUDA program is as 
follows. 

1 . Allocate global memory in the GPU according to the input 
parameters and transfer data from CPU to GPU. 

2. Do the simulation in the GPU. 

(a) Construct the invariant neighbor list. 

(b) Evolve the system according to the interaction poten- 
tial. In the equilibrium stage, control the temperature 
and / or pressure; in the production stage, record the 
heat current data for each time step into global mem- 
ory. 

(c) Calculate the HCACF from the recorded heat current 
data. 

3. Transfer the HCACF data from GPU to CPU and analyze 
the results. 

To facilitate later discussions, we list the main data in the 
global memory of the GPU and the relevant parameters: 

1 . is the number of particles in the simulated system. 

2. Nc is the number of correlation steps. 

3. Np is the number of production steps. 

4. NN; (0 < / < - 1) is the number of neighbor particles 
for particle 

5. NLik {0 < i < N ~ I, < k < NNi - I) is the index of the 
A:-th neighbor particle of particle /. 

6. r,, Vj, F, and J, (0 < ; < - 1) are the position, velocity, 
force and heat current for particle ;. 

7. J(;) (0 < / < A^p - 1) is the total heat current of the system 
at time step /. 

8. C(/) (0 < / < Nc - 1) is the i-th HCACF data. 

We now discuss the details of the implementations of the rel- 
evant kernels in the GPU. 

3.2. The neighbor list construction kernel 

For our purpose, the simple Verlet neighbor list Issll suffices. 
Although in our applications, the neighbor list needs no update 
during the simulation, it is still advantageous to implement it di- 
rectly in the GPU, which can reduce the computation time sig- 
nificantly compared with a direct 0{N^) implementation in the 
CPU. We use a simple algorithm to construct the Verlet neigh- 
bor list as listed in Algorithm [T] This kernel is launched with 
the execution configuration «<\N jS h\ S b»>, where A^ is the 
number of particles and S /, is the block size. When A^ is not an 
integer multiple of Si,, the indices of some threads in the kernel 
would exceed A^. The if statement in line 2 ensures that only 
the valid memory is manipulated. After executing this kernel, 
the number of neighbor particles for particle ; is stored in NN,- 
and the index of the ^-th neighbor particle of particle / is stored 
in NL,vt. 

Note that we have expressed the neighbor list in a matrix 
from Nik, but the actual order of the data in this matrix still 
needs to be specified when we implement it in the computer 
code. There are two natural ways to arrange the data. The first 
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is to store all the indices of the neighbor particles of the i-th 
particle consecutively, i.e., in the order of NLoo, NLoi, NL02, 

• ■ ■ , NLio, NLii, NL12, ■ ■ ■ , NL,o, NL,i, NLq, • • • . The second 
is to store the indices of the k-th neighbor particles for all the 
particles consecutively, i.e., in the order of NLqo, NLio, NL20, 

• ■ ■ , NLoi, NLii, NL21, ■ ■ ■ , NLoA-, NLu, NLjt, ■ ■ ■ . While for 
a serial CPU implementation, the first choice is more prefer- 
able, it turns out that, for our GPU implementation, different 
force evaluation schemes require different forms of the neigh- 
bor fist to ensure coalesced global memory access. This will be 
discussed in more detail when we present the force evaluation 
algorithms. 

Algorithm 1 Pseudo code for Verlet neighbor list construction 

kernel. 

Require: b is the block index 
Require: t is the thread index 
Require: 5 ^ is the block size 
Require: i - ShXb + t 

if i < N then 

for = to - 1 do 
a j ^ i then 



Tij <— minimum image of r^ - r,- 

if |r/jp < square of the cutoff radius then 

k^k+l 
end if 
end if 
end for 

NN/ ^ k 
end if 



3.3. The integration kernels 

The update of positions and velocities for one particle is in- 
dependent of that for another particle. Therefore, they can be 
carried out in parallel. Since this operation is rather simple, we 
can assign the task of updating the velocity and position of a 
particle to a single thread. Thus, the execution configuration of 
any of the integration kernels is the same as that of the neighbor 
list construction kernel. Again, an if statement is necessary to 
ensure that invalid memory is not manipulated. The algorithms 
for these kernels are rather straightforward and not provided 
here. 

3.4. Force evaluation kernel 

Force evaluation is the most time consuming part of most 
MD simulations and thus deserves special consideration. 

3.4.1. The thread-scheme for force evaluation 

Since the calculation of the total force for one particle is in- 
dependent of the calculation of the total force for any other par- 
ticles, the most natural choice for GPU implementation of the 
force evaluation kernel is to assign a single thread to one par- 
ticle and loop for all the neighbor particles of it to accumulate 



the total force exerted on this particle. To our knowledge, pre- 
vious works either follow more or less this strategy when using 
the neighbor list approach or use a cell-list approach for force 
evaluation instead. The pseudo code for this thread-scheme is 
presented in Algorithm |2l The total force and heat current for 
one particle are accumulated (lines 10 and 11) in the for loop 
and saved (line 13) to global memory after exiting the for loop. 

Global memory access is relatively slow compared with the 
access of other memories such as shared memory and registers 
in the GPU, and we have made many efforts to minimize global 
memory access as much as we can, although it is not manifest 
from the presented pseudo codes. For example, in Algorithm 
|2] we have to calculate the distances between one particle and 
all its neighbor particles r,j (0 < j < NN,). While we need to 
read in the positions r^ (0 < j < NN,) from the global memory 
within the for loop, we need not read in the position r, repeat- 
edly within the for loop. Instead, we can read in r, before enter- 
ing the for loop and store it in a register As pointed out earlier, 
registers are fast but their number for each SM is limited, and 
excessive use of them will deteriorate the performance. In the 
case that registers are not enough for use, we can use some 
shared memory substituting for registers. 

When a global memory access is unavoidable, it is impor- 
tant to maximize the coalescing by using the most optimal ac- 
cess pattern possible. Generally, the positions r^ and velocities 
V; within the for loop are accessed in a random pattern, and a 
special particle sorting method has been developed to generate 
better memory access pattern |5|j]. As for the global memory ac- 
cess for the neighbor list as given in line 5, we note that here we 
should choose the neighbor list representation where the indices 
of the ^-th neighbor particles for all the particles are stored con- 
secutively. This makes nearby threads in a warp access nearby 
elements in the neighbor list. 

In the thread-scheme, the execution configuration of the force 
evaluation kernel is the same as that of the neighbor list con- 
struction kernel, for which the number of blocks invoked is 
\NIS i,\ which is not large enough to fully utilize the computa- 
tional resources of a modern GPU for relatively small systems. 
This motivates us to consider the other force evaluation scheme, 
as discussed below. 

3.4.2. The block-scheme for force evaluation 

To increase the number of blocks in the force evaluation ker- 
nel, we notice that for force evaluation, there is a second level 
of parallelism: the calculations of the pair forces between one 
particle and all its neighbor particles are also independent of 
each other. Therefore, instead of accumulating the total fore of 
a particle in a single thread, we can also calculate the pair forces 
between the given particle and all its neighbor particles in dif- 
ferent threads. These threads should be in the same block in 
order to accumulate the total force of the considered particle in 
an efficient way by making use of shared memory. In the sim- 
plest case, one thread in the block only computes a single pair 
force. For example, (block /, thread j) is assigned to calculate 
the force acted on particle / from the y-th neighbor particle of 
In this case, the number of blocks for the force evaluation kernel 
is exactly the number of particles in the system, which is much 
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Algorithm 2 Pseudo code for the force evaluation kernel in the 

thread-scheme. 

Require: b is the block index 
Require: t is the thread index 
Require: 5/, is the block size 
Require: i - S t^b + t 

1: F; ^ 
2: J/ ^ 

3: if / < then 

4: for /t = to NN; - 1 do 

5: ^ NL,i 

6: Tij <— minimum image of rj - r,- 

7: Vy V; + Vj 

8: Calculate the pair force F,y 

9: Calculate the pair heat current J,y 

10: F/ ^ F; + F,-,- 

11: J;^J,+J,7 

12: end for 

13: Save F, and J, to global memory 
14: end if 



larger than the total number of SMs in the GPU for a system 
with intermediate size. We call this the block-scheme for force 
evaluation and the implementation is given in Algorithm |3] 

As in the case of the thread-scheme, the choice of the neigh- 
bor list representation is important. Since in the block-scheme 
nearby threads in a warp evaluate the pair forces between a 
given particle and a portion of its neighbor particles, we should 
choose the neighbor list representation where the indices of the 
neighbor particles of the ;-th particle are stored consecutively. 



Algorithm 3 Pseudo code for the force evaluation kernel in the 

block-scheme. 

Require: ; is the block index 

Require: k is the thread index 

Require: F,y and are in shared memory 

1: F,4 ^O' 
2: J,A ^ 

3: if;t<NN, then 
4: NL,i 

5: r,j <— minimum image of r^ - v, 

6: V,7 ^ V; + V^- 

7: Calculate the pair force F,y 

8: Calculate the pair heat current J,j 

9: end if 

10: synchronize the threads in each block 

1 1 : Binary reductions of F, ^ and J, ^ to F,o and J,o 

12: Save F,o and J,o to global memory F,- and J, 



In the block-scheme, shared memory must be used for 
recording the pair forces F,y and the corresponding heat cur- 
rent components J,j. Otherwise, there is no efficient way to 
sum them up to get the total force and heat current for a given 
particle. Before recording the data, we should initialize all the 
elements of F, , and J, , to be zero, as shown in lines 1 and 2, 



since the number of neighbor particles for some particles would 
probably be less than the block size. After recording the calcu- 
lated data into F,j and J,y, we should synchronize the threads 
in each block before summing them up. The synchronization 
ensures that all the threads have completed their calculations 
before the summation. The summation can be carried out effi- 
ciently in the way of binary reduction, i.e., add the second half 
of data to the first half and repeat the process till we get a single 
number, which is the sum of the whole data. 

The execution configuration for the force evaluation kernel 
in the block-scheme deserves more consideration. Firstly, the 
block size should be an exponential of two, such as 128, in or- 
der to do the binary reductions for F,y and Jy . Secondly, since 
the number of blocks in the kernel equals the number of parti- 
cles, which has the chance of exceeding the maximal value of 
the first dimension of the grid size, which is 65535 for Tesla 
M2070, we need to use a two dimensional grid in such a way 
that the difference between the total number blocks in the grid 
and the number of particles is minimized. 

In the above discussion of the block-scheme, we only consid- 
ered the simplest case, where a whole block of threads are de- 
voted to calculating the total force of a single particle and each 
thread only calculates one pair force. In general, we can have 
two alternatives, depending on the values of the block size S b 
and the maximal number of neighbor particles N„,. If N,„ > S b, 
we should calculate at most IN,„/S b] pair forces in each thread. 
For example, if N„, - 256 and Sb - 128, each thread has to 
calculate two pair forces. In this case, it is important to access 
the global memory in a coalesced way. This requires to use 
thread j to calculate the pair forces associated with the j-th and 
{j + 5'fe)-th neighbor particles, instead of those associated with 
two adjacent neighbor particles. If N,,, < S b/2, we do not need 
to use a whole block for one particle. For example, if N„, - 64 
and S b = 128, we can use one block for two particles, with the 
first 64 threads for the first one and the second 64 threads for 
the second one. 

After the execution of the force evaluation kernel, either in 
the thread-scheme or in the block-scheme, the total forces F,- 
and heat currents J, for the individual particles are stored into 
global memory, and we should, at each time step, sum up the 
heat current values for the individual particles to get the heat 
current for the whole system, J = 2,1/. This can also be done 
efficiently by using the binary reduction. 

3.5. HCACF calculation 

As discussed in the beginning of this section, to obtain a 
high acceleration rate for the whole program, one has to imple- 
ment the whole evolution part in the GPU. Although the post- 
processing part of the program, namely, the part for HCACF 
calculation usually takes up only a fraction of computation time 
of the whole program, it is still advantageous to implement this 
part on the GPU rather than on the CPU, since when we have 
obtained orders of magnitude of speedup for the evolution part, 
the post-processing part would take up a significant portion of 
the whole computation time. 

The GPU implementation of HCACF calculation is very 
straightforward. Since the ensemble averages (which are time 
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averages in MD) of the HCACF at different correlation times 
can be performed independently, we can simply use one block 
for one point of the HCACF data. The pseudo code for the 
HCACF calculation kernel as presented in Algorithm |4] is de- 
signed according to Eq. This kernel is executed with the 
configuration of «<Nc, 5 /,>>>, where Nc is the number of cor- 
relation steps and 5/, is the block size. As in the case of the 
force evaluation kernel in the block-scheme, if the number of 
blocks (which is A^^ here) exceeds the upper bound of the first 
dimension of the grid size (which is 65535 for Tesla M2070), 
a two dimensional grid is needed. After executing this kernel, 
the HCACF data are saved in global memory, which will be 
transferred to CPU for analysis. 

The value of M in Eq. which is the number of time 
origins used to calculate the ensemble averages of the HCACF 
data is chosen to be S hl{Np - Nc)/S i,} for any correlation time. 
This will waste a small portion of heat current data if A^^ -Ncis 
not an integer multiple of Si,. Usually, A^,, is much larger than 
both Ne and S b, and we have M x Np. 



Algorithm 4 Pseudo code for HCACF calculation kernel. 

Require: ; is the block index 

Require: j is the thread index 

Require: 5/, is the block size 

Require: C/y is in shared memory 

1: Cij ^ 

2: for n = to l(Np - Nc)/Sbi - 1 do 

3: Cij ^ Cij + J(j + nS t)i(i + nSt + i) 

4: end for 

5: synchronize the threads in each block 

6: binary reduction of Cij to C,o 

7: Save C,() to global memory C(0 



3.6. Use of texture memory 

Texture memory is a kind of read-only memory that is cached 
on chip. In some situations, it provides higher effective band- 
width, especially in the cases that memory access patterns ex- 
hibit a great deal of spatial locality, i.e., nearby threads are 
likely to read from nearby memory locations. This is the case 
for the force evaluation kernel (both the thread-scheme and the 
block-scheme), where the access of the global memory contain- 
ing the data of the positions r, and velocities v, of the neighbor 
particles for a given particle is not coalesced but exhibits some 
spatial locality. The overall gain of performance by using tex- 
ture memory is about 10%. 

3.7. Global memory requirements 

Lastly, we give an estimation of the memory requirements 
with respect to the simulation size and simulation time. Most of 
the global memory is occupied by the neighbor list and the heat 
current data. For the neighbor list, the required global mem- 
ory scales with the number of particles N as2 N/IQ^ GB if the 
maximum number of neighbor particles for one particle is less 
than 512. For the heat current data, the required global memory 
scales with the number of production steps A^^ as 1.2 Np/lO^ 



GB and 2.4 Np/lO GB for single and double precisions, re- 
spectively. Thus, it is quite safe to perform a simulation with 
the system size as large as N - 10^ and the number of produc- 
tion steps as large as A^^ = 10*^ for Tesla M2070, which has 6 
GB of global memory. The Green-Kubo method rarely needs 
a simulation domain size as large as one million particles. Re- 
garding the simulation time, if 10*^ heat current data are not 
enough to obtain a well converged HCACF, we can simply per- 
form the same simulation for a number of times with different 
initial random velocities and average the resulting HCACFs to 
get a better one. 

4. Performance measurements 

In this section, we measure the performance of our GPU code 
running on a Tesla M2070 and compare it with that of our CPU 
code running on an Intel Xeon X5650. The CPU code is im- 
plemented in C/C++ and is compiled using g++ with the 03 
optimization mode. Newton's third law is also used to save un- 
necessary calculations for the CPU code. 

4.1. The evolution part 

In this subsection, we evaluate the performance of the evolu- 
tion part, with an emphasis on the comparison of the two force 
evaluation schemes. To be specific, we only present the results 
for the production stage, where the heat current data need to 
be calculated and recorded. The results for the equilibration 
stage are similar. We test the performance for the evolution part 
by measuring the computation time for 1000 time steps. The 
computational speed is defined as the product of the number of 
particles and the number of steps divided by the computation 
time. In the literature IHS], the inverse of this quantity is also 
used to evaluate the computational speed. The speedup factor 
is defined to be the computation time for the CPU over that for 
the GPU. The system size used for our evaluation spans from 
a few hundred to 64 thousand. We also considered different 
cutoff radii, which determine the maximal number of neighbor 
particles N„,. 

4. 1.1. The LJ potential 

The performances for the LJ potential are presented in Fig. 
[1] where the computational speeds and the speedup factors are 
plotted against the simulation size. 

For our CPU code, the LJ potential with N,„ = 128 executes 
at a speed of about 5.0 x 10^ particle ■ step / second, or equiv- 
alently, 2.0 //s / (particle ■ step). The computational speed de- 
creases with the increasing of the maximal number of neighbor 
particles A^,„. For N,,, - 512, the computational speed slows 
down to about 9.0 x lO'* particle ■ step / second. For relatively 
small systems (A^ < lO''), the computation speed is nearly inde- 
pendent of the simulation size, indicating a good linear-scaling 
dependence of the computation time on the simulation size. 
However, the linear-scaling behavior is not well preserved for 
relatively large systems (A^ > 10^). This is probably due to the 
more expensive memory operations for larger data arrays asso- 
ciated with larger simulation size. We will come to this problem 
when we discuss the RI potential later 
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Figure 1: (Color online) Performance evaluation for the evolution part in the production stage for the LJ potential: (a-c) computation speeds for 
both the CPU code and the GPU code with different force evaluation schemes; (d-f) speedup factors of the GPU code over the CPU code; (g-i) 
relative speedup factors of the block-scheme over the thread-scheme. The maximal number of neighbor particles in the simulations (128, 256 or 
512) are indicated in each subplot. 
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Figure 2: (Color online) Performance evaluation for the evolution part in the production stage for the RI potential: (a-c) computation speeds for 
both the CPU code and the GPU code with different force evaluation schemes; (d-f) speedup factors of the GPU code over the CPU code; (g-i) 
relative speedup factors of the block-scheme over the thread-scheme. The maximal number of neighbor particles in the simulations (128, 256 or 
512) are indicated in each subplot. 
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Our GPU implementation achieves high performance and 
large speedup factors. By using the thread-scheme, the com- 
putational speed can be as high as 3.5 x 10^ particle ■ step / sec- 
ond, or equivalently, 0.029 fj.s / (particle • step) for Nm = 128 
and = 8000, which gives a speedup factor of about 70. For 
N,„ =512 and = 8000, the speedup factor can reach 95 by 
using the thread-scheme. 

From Fig. [T](a-c) we can see that the computational speeds 
for the thread-scheme and the block-scheme saturate at different 
simulation sizes. As discussed before, the number of invoked 
blocks for the force evaluation kernel in the thread-scheme is 
[N/S I,]. With a block size of Si, - 128, the number of blocks 
only reaches the upper bound of the number of resident blocks 
in the GPU when = 16384. This explains why the perfor- 
mance for the thread-scheme saturates at about = 10"*. In 
contrast, the number of invoked blocks for the force evaluation 
kernel in the block-scheme is the number of particles, and the 
block-scheme can attain its peak performance with only a few 
hundred particles. 

There is always a crossover of the performances for the two 
force evaluation schemes. For the cases of N„, = 128, 256 and 
512, the simulation sizes at which the crossovers take place 
are around 1000, 3000 and 5000, respectively. From Fig. [T] 
(g-i) we can see that while the thread-scheme is a little faster 
when A^ < 10"*, the block-scheme is several times faster when 
A^ > 10"*. The smaller the system, the higher the relative 
speedup factor of the block-scheme over the thread-scheme. 
This makes the block-scheme more preferable for thermal con- 
ductivity calculation using the Green-Kubo method, where we 
rarely need to consider system with more than a few thousand 
particles. There are two main reasons for the superior perfor- 
mance of the thread-scheme over the block-scheme for large 
systems when saturation is obtained for both schemes. The first 
is that in the block-scheme, the total force and heat current for 
each particle has to be accumulated by binary reduction, during 
which only a portion of threads are used to do the calculation. 
The second is that there is more global memory access for the 
block-scheme. In the block-scheme, the positions r,- and veloci- 
ties V, as used in lines 5 and 6 of Algorithin|3]need to be trans- 
fered from global memory to all the threads in the block cor- 
responding to the heading particle /. In contrast, in the thread- 
scheme, the positions and velocities for a heading particle need 
only to be transfered from global memory to a single thread. 

From Fig. [I](d-f) we can see that while the speedup factors 
for the thread-scheme nearly do not vary with the increasing of 
A^m, the speedup factors for the block-scheme increase signif- 
icantly with the increasing of N,„. The computation time for 
the thread-scheme is dominated by the for loop in lines 4-12 of 
Algorithm |2] and scales nearly linearly with respect to N,„, as 
in the case of the CPU code. In contrast, in the block-scheme, 
a significant portion of computation time is spent on the extra 
global memory access of positions and velocities and the binary 
reduction operations for force and heat current. The amount of 
the extra global memory access scales with A^ and the number of 
binary reduction operations is determined by 5;,, both of which 
do not scale with A^,„. Thus, the thread-scheme performs better 
and better with the increasing of N,„. 



4.1.2. The RI potential 

The RI potential is much more computationally intensive 
than the LJ potential. One can see from Fig. |2] (a) that for 
Nm - 128, the CPU computational speed for the RI potential 
is about 6x10'* particle • step / second, or equivalently, 17 //s 
/ (particle ■ step), which is about 12 times slower than that for 
the LJ potential with the same N,„- Thus, compared to the LJ 
potential, the computation time for the RI potential would be 
dominated by the actual floating point operations, and this can 
help to understand why the linear-scaling behavior with respect 
to the simulation size is preserved much better for the RI poten- 
tial (Fig. |2](a-c)) than for the LJ potential (Fig. [r|(a-c)). 

As for the performances for the GPU code, the results for 
the RI potential are even more impressive compared with the 
LJ potential. The speedup factors can be as large as 300 for the 
thread-scheme. The higher acceleration rate for the RI potential 
compared to that for the LJ potential results from the higher 
arithmetic intensity of RI potential. 

The testing results for the RI potential exhibit similar features 
as in the case of the LJ potential, which can be listed as follows: 

1. The performance for the thread-scheme saturates only 
when A^ > lO'*, while that for the block-scheme saturates 
for a few hundred particles. 

2. For each N,„, there is a crossover point for the performance 
curves, before which the block-scheme performs better, 
and after which the opposite is true. The simulation sizes 
associated with the crossover points are around 2000, 5000 
and 7000 for A^,„ = 128, 256 and 512, respectively. 

3. The performance for the thread-scheme is weakly depen- 
dent on the value of N,„. In contrast, the average speedup 
factor for the block-scheme varies from 150 to 200 to 250 
when Nm increases from 128 to 256 to 512. 

All these points can be explained similarly as in the case of the 
LJ potential. 

4.2. Neighbor list construction 

The computation times for constructing the neighbor list in 
the CPU and the GPU are compared in Fig. [3] The computation 
time for the CPU implementation scales quadratically with the 
simulation size, as expected. Although our GPU implementa- 
tion is nearly a direct translation of the CPU implementation, its 
computation time scales linearly for A^ < 10"^ and only begin to 
scale quadratically for A^ > 10"*. Furthermore, for A^ < lO"*, the 
computation time for the neighbor list construction is less than 
that of 10 production steps without neighbor list update in the 
block-scheme for LJ potential with A^,„ = 128. As a comparison, 
the computation time for the neighbor list construction reported 
by Anderson ef aZ jstl, who use a cell decomposition approach, 
is about 10 times larger than their reported computation time for 
the force evaluation with one step. Thus, for A^ < 10"*, the sim- 
ple neighbor list construction method as given in Algorithm [1] 
can lead to rather good performance. However, for larger sys- 
tems, the cell decomposition approach will definitely perform 
better. 
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4.3. HCACF calculation 

Figure IHpresents the computation times for the HCACF cal- 
culation in the CPU and the GPU. The calculation of HCACF 
for the case of A^^. = 5 x lO'* and A^^ = 10** takes up more than 
one day using a CPU but only half an hour using a GPU. From 
the previous discussion, we can see that the evolution part of 
the MD simulation achieves more than two orders of magni- 
tude speedup. If we do not implement this post-processing part 
in the GPU as well, the speedup factor for the whole program 
will be much lower than that for the evolution part alone. For 
example, in the next section, we will apply the GPU code in the 
block-scheme to calculate the thermal conductivity of PbTe at 
a given temperature using the following values of the relevant 
parameters: = 512, Nm - 512 and Np - 10*^. The com- 
putational speed is 3.4 X 10^ particle ■ step / second by using 
the block-scheme (Fig. |2](c)), and the computation time for the 
evolution part would be less than 5 hours, which is even much 
less than that for the HCACF calculation in CPU. 



Figure 3: Computation times for the neighbor list construction in the 
CPU and the GPU as a function of the simulation size. The compu- 
tation time of 10 production steps without neighbor list update in the 
block-scheme for LJ potential with A',,, = 128 is also presented for 
comparison. 
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Figure 4: (Color online) Computation times for the HCACF calcula- 
tions in CPU and GPU as functions of the production time steps N,,. 
The number of data points for HCACF is chosen to be A'^ = 5 x 10'*. 



5. Validation 

In this section, we validate our GPU implementation by 
studying the lattice thermal conductivity of solid argon and bulk 
PbTe, using the LJ potential and the Rl potential, respectively. 

5.7. Determining the lattice constant at zero pressure 

We only consider systems without external pressure (0 Pa). 
The correct determination of the lattice constant is crucial for 
the correct prediction of the lattice thermal conductivity. In 
fact, the under-prediction of the lattice thermal conductivities 
of solid argon il9fl compared to experimental data results from 
an overprediction of the lattice constants, which is corrected 
by later studies The zero-pressure lattice constant for 

non-bonded potential can be obtained by using an NPT ensem- 
ble to control the pressure as well as the temperature of a suf- 
ficiently large system with a large cutoff" radius for force evalu- 
ation. The calculated lattice constants at diff'erent temperatures 
for solid argon and PbTe are shown in Fig. |5] For solid ar- 
gon, the results by McGaughey et al ll20ll are also presented for 
comparison. The lattice constants at zero temperature are ob- 
tained from the cohesive energy curves. For solid argon, the 
zero-temperature lattice constant is calculated to be 5.25 A. As 
a comparison, the one obtained by McGaughey et al lEoll is 5.24 
Aand the experimental value |36] is 5.30 A. For PbTe, the zero- 
temperature lattice constant is calculated to be 6.52 A, which 
is the same as that obtained by Qiu et al Bill . The lattice con- 
stants for PbTe at elevated temperatures also exhibit a linear- 
dependence behavior on temperature in the range of 300-700 
K, from which we can deduce a well defined value of the ther- 
mal expansion coefficient, 2.30x1 0"^ ' which is comparable 
to the experimental value 2.04 x 10"^ K ' . 



5.2. Results for lattice thermal conductivities 

After determining the lattice constants, we can calculate the 
zero-pressure lattice thermal conductivities at different temper- 
atures. For argon, the cutoff" radius for force evaluation is cho- 
sen to be Rc - 3cr and the time step is chosen to 6t - 2 fs. 
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Figure 5: (Color online) Zero-pressure lattice constants for solid argon (a) and PbTe (b) at different temperatures. For solid argon, the results by 
McGaugiiey et al (23l are also presented for comparison. 



We firstly equilibrate the simulated system in the NVT ensem- 
ble for Ne - 10^ steps and then evolve the system in the NVE 
ensemble for Np - 10^ steps. During the production stage, 
heat current data are recorded at each step. For PbTe, the cor- 
responding parameters are chosen to be = 16 A, 6t - 0.2 
fs, Ne — 10^ and Np — 10*^. Another parameter which is only 
relevant for PbTe, namely, the electrostatic damping factor, is 
chosen to be cc = 0.2 A which is a reasonble choice as sug- 



gestted by Fennell et al 112711 . 



5.2.7. Solid argon 

Figure |6] (a-e) gives the calculated results for HCACFs and 
RTCs at different temperatures for solid argon using the GPU 
code in the block-scheme. For all the temperatures, well con- 
verged HCACF and RTC can always be obtained, as long as 
we collect sufficiently many heat current data. The curves pre- 
sented in Fig |6] (a-e) are obtained by setting the number of pro- 
duction steps to be Np = 10^. Since the curves are so smooth, 
we need not do any fitting to obtain a well defined value of ther- 
mal conductivity. 

Figure |6] (f) compares our simulation results with those re- 
ported by McGaughey et al lE5l . We have tested the finite size 
effects and found that a simulation size of = 5x5x5x4 = 500 
is sufficient. It can be seen that our results agree well with 
theirs. Since their results are well established, this agreement 
provides a strong evidence of the correctness of our program. 

5.2.2. PbTe 

The calculated HCACFs and RTCs at different temperatures 
for PbTe using the GPU code in the block-scheme are presented 
in Fig |7] (a-e). For PbTe, a number of 10^ production steps is 
not sufficient to obtain smooth curves for HCACF and RTC. 
The curves presented in Fig Q (a-e) are obtained by setting the 
number of production steps to be Np - 10**. It can be seen 
that even if there exists high frequency oscillations (caused by 
optical phonons in the HCACF, the RTC still exhibits a 
very smooth behavior as long as we collect sufficiently many 
heat current data. 



Figure|7](f) presents our calculated lattice thermal conductiv- 
ities of PbTe at different temperatures. We have tested the size 
effects and found that a simulation size ofA^ = 4x4x4x8 = 512 
is sufficient. Our results agree well with the original results 
obtained by Qiu et al llsoll . Both results agree fairly with the 
experimental data 138 1 . 



6. Conclusions 



In conclusion, we presented in detail the development and 
optimization of a molecular dynamics simulation program fully 
implemented in the GPU, which can calculate the lattice ther- 
mal conductivity using the Green-Kubo formula. For the most 
time-consuming part, the force evaluation part, we compared 
two alternative approaches, a thread-scheme where the total 
force for a particle is accumulated in a single thread and a block 
scheme where the pair forces for a particle is distributively cal- 
culated in different threads within a block and summed up using 
shared memory to obtain the total force of the given particle. 
For both LJ and Rl potentials, the block-scheme outperforms 
the thread-scheme for smaller systems. This makes the block- 
scheme particularly preferable for thermal conductivity calcula- 
tions using the Green-Kubo approach, which is more demand- 
ing on the simulation time rather than the simulation size. The 
correctness of our implementation is validated by calculating 
the lattice thermal conductivities of solid argon and PbTe. 

For large systems, the speedup factors obtained reach about 
one hundred and three hundred for LJ and Rl potentials, re- 
spectively. The higher acceleration rate for the Rl potential 
compared with that for the LJ potential results from its higher 
arithmetic intensity, defined as the number of arithmetic oper- 
ations divided by the number of memory operations. Thus, we 
would expect that interaction potentials with higher arithmetic 
intensity, such as the bond-order potential, would lead to even 
higher speedup factors. 
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Figure 6: (Color online) HCACFs and RTCs (a-e) of solid argon as a function of the correlation time and the converged lattice thermal conductivity 
as a function of temperature (f). In (f), the MD results obtained by McGaughey et al (20I1 are also presented for comparison. 
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Figure 7: (Color online) HCACFs and RTCs (a-e) of PbTe as a function of the correlation time and the converged lattice thermal conductivity as 
a function of temperature (f). In (f), experimental data (3^ and the MD results obtained by Qiu et al (30I1 are also presented for comparison. 
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